Feature Extraction and Clustering of Croatian News Sources
نویسنده
چکیده
This paper presents the design of a system for feature extraction and classification of news articles from Croatian news sources. An overview of supervised and unsupervised text classification and clustering machine learning techniques is presented. The techniques described are those most widely used for text classification tasks. The paper discusses a number of issues particular to text classification of the news source material, from its collection and organization to particular problems related to the evaluation of method correctness and categorization efficiency on Croatian news documents. Uses of these techniques are discussed and a proposal for their quantitative evaluation over a newly developed testing news corpus is proposed.
منابع مشابه
News Feature Extraction for Events on Social Network Platforms
Microblog-based social network platforms like Twitter and Sina Weibo have been important sources for news event extraction. However, existing works on microblog event extraction, which usually use keywords, entities, or selected microblogs to represent events, are not able to extract details of an event. Based on the view of news report, an event should present detailed news features, i.e., whe...
متن کاملSupervised Feature Extraction of Face Images for Improvement of Recognition Accuracy
Dimensionality reduction methods transform or select a low dimensional feature space to efficiently represent the original high dimensional feature space of data. Feature reduction techniques are an important step in many pattern recognition problems in different fields especially in analyzing of high dimensional data. Hyperspectral images are acquired by remote sensors and human face images ar...
متن کاملStructuring the Blogosphere on News from Traditional Media
News and social media are emerging as a dominant source of information for numerous applications. However, their vast unstructured content present challenges to efficient extraction of such information. In this paper, we present the SYNC3 system that aims to intelligently structure content from both traditional news media and the blogosphere. To achieve this goal, SYNC3 incorporates innovative ...
متن کاملSuffix Tree Based Chinese Document Feature Extraction and Clustering in RSS Aggregator
In RSS aggregator, the important issue is how to make the feeds information more manageable for RSS subscriber. In this paper, we propose a suffix tree based RSS feeds document clustering in Chinese RSS aggregator. We construct a suffix tree with meaningful Chinese words, and choose the phrases with high score given by a formula as document features. We cluster document using group-average algo...
متن کاملEvent Extraction from Heterogeneous News Sources
With the proliferation of news articles from thousands of different sources now available on the Web, summarization of such information is becoming increasingly important. Our research focuses on merging descriptions of news events from multiple sources, to provide a concise description that combines the information from each source. Specifically, we describe and evaluate methods for grouping s...
متن کامل